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Abstract 

We present a flexible Alternating Direction Method of Multipliers (F-ADMM) algorithm for solving 
optimization problems involving a strongly convex objective function that is separable into n >2 blocks, 
subject to (non-separable) linear equality constraints. The F-ADMM algorithm uses a Gauss-Seidel 
scheme to update blocks of variables, and a regularization term is added to each of the subproblems 
arising within F-ADMM. We prove, under common assumptions, that F-ADMM is globally convergent. 

We also present a special case of F-ADMM that is partially parallelizable, which makes it attractive 
in a big data setting. In particular, we partition the data into groups, so that each group consists of 
multiple blocks of variables. By applying F-ADMM to this partitioning of the data, and using a specific 
regularization matrix, we obtain a hybrid ADMM (H-ADMM) algorithm: the grouped data is updated in 
a Gauss-Seidel fashion, and the blocks within each group are updated in a Jacobi manner. Convergence 
of H-ADMM follows directly from the convergence properties of F-ADMM. Also, a special case of H- 
ADMM can be applied to functions that are convex, rather than strongly convex. We present numerical 
experiments to demonstrate the practical advantages of this algorithm. 
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1 Introduction 

In this work we study the optimization problem 

n 

(la) 

n 

'^A^Xi = b, (lb) 

where, for each i = 1,..., n, the function fi : —>• R U {oo} is strongly convex, closed, and extended real 

valued, and the vector b G R™ and matrix Ai G RrnxAfi represent problem data. Note that the objective 
function (lla|l is separable in the decision vectors Xi,... ,Xm but that the linear constraint m links them 
together, which makes problem m non-separable overall. 

We can think of the decision vectors {xi} as “blocks” of a single decision vector x G R'^, where N = 
Dr=i This can be achieved by partitioning the N x N identity matrix I column-wise into n submatrices 
{Ui G so that / = [Ui ,..., [/„], and then setting x = X)r=i UiXi. That is, x is the vector formed 

by stacking the vectors loP ol ®^ch other. It is easy to see that Xi = Uf x G R^y and that if we 

let A := X)r=i S then (llbD is equivalent to Ax = b. Note that Ai = AUi for j = 1, 2,..., n, 

and that we can write A = [Ai,..., A„]. If we now let f(x) := problem ([T]) is equivalent to 

minimize /(x) (2a) 

subject to Ax = b. (2b) 

Although problems H]) and ([2]) are mathematically equivalent, it is important to note that the best algorithms 
for solving them take advantage of the block structure that is made explicit in formulation O- 


minimize 

subject to 


1 


1.1 Relevant Previous Work 


Many popular algorithms for solving ([T]) (equivalently, for solving ([2])) are based on the Augmented La- 
grangian function. In the remainder of this section, we describe several such algorithms that are closely 
related to our proposed framework. 

The Augmented Lagrangian Method of Multipliers (ALMM) 

The ALMM (e.g., see m) is based on the augmented Lagrangian function 

y) ■■= f{x) - {y, Ax-b) + ^\\Ax- &||i, (3) 

where p > 0 is called the penalty parameter, y € R™ is a dual vector that estimates a Lagrange multiplier 
vector, and {p,q) = p^q is the standard inner product in R". The most basic variant of ALMM (see 
Algorithm [T]) , involves two key steps during each iteration. First, for a fixed dual estimate, the augmented 
Lagrangian (|3]) is minimized with respect to the primal vector x. Second, using the minimizer computed in the 
first step, a simple update is made to the dual vector that is equivalent to a dual ascent step for maximizing 
an associated dual function. In practice, computing the minimizer in the first step is the computational 
bottleneck. This is especially true for large-scale problems that arise in big data applications, and therefore 
extensive research has focused on reducing its cost (e.g., decomposition techniques dn m m]). 


Algorithm 1 A basic variant of ALMM for solving problem ([2]). 

1: Initialization: S R"*, iteration counter fc = 0, and penalty parameter p > 0. 

2: while the stopping condition has not been met do 

3: Update the primal variables by minimizing the augmented Lagrangian: 



a;U+i) ^ argmin Cp{x;y^’^'>) 

X 

(4) 

4: Update the dual variables: 

y{k+l) ^ y{k) _ p(^a.(fc+l) _ b) 


5: Set k k + 1. 

6: end while 




Although sophisticated variants of ALMM are successfully used in many important application areas 
(e.g., optimal control in natural gas networks m), generally they are unable to directly take advantage of 
the block separability described in formulation ([T|), when it exists. Nonetheless, ALMM serves as the basis 
for many related and powerful methods, as we now discuss. 

The Alternating Direction Method of Multipliers (ADMM) 

The ADMM has been a widely used algorithm for solving problems of the form ([T} when n = 2, for convex 
functions. Global convergence of ADMM was established in the early 1990’s by Eckstein and Bertsekas m 
while studying the algorithm as a particular instance of a Douglas-Rachford splitting method. This rela¬ 
tionship allowed them to use monotone operator theory to obtain their global convergence guarantees. (An 
introduction to ADMM and its convergence theory can be found in the tutorial style paper by Eckstein [9] . 
See also [4].) Pseudocode for ADMM when n = 2 is given below as Algorithmic 
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Algorithm 2 ADMM for solving problem ([1]) when n = 2. 

1: Initialization: x (0) e R^, e R"", iteration counter fc = 0, and penalty parameter p > 0. 
2: while the stopping condition has not been met do 
3: Update the primal variables in a Gauss-Seidel fashion; 



^(fc-i-i) ^ argmin £p(x,X2^^;y^^^) 

X 

(5a) 


Xn ^ argmin ’) 

X 

(5b) 

4: Update the dual variables: 

y(k+l) ^ y{k) _ p( 24 a;(fc+i) _ h) 


5; Set fc ^ /c -I-1. 

6; end while 




In words, ADMM works as follows. At iteration k, for a fixed multiplier and fixed block , the 
new point is defined as the minimizer (for simplicity, we assume throughout that this minimizer exists 

and that it is unique) of the augmented Lagrangian with respect to the first block of variables xi. Then, 
in a similar fashion, the first (updated) block is fixed, and the augmented Lagrangian is minimized 

with respect to the second block of variables X 2 to obtain X 2 ^ • Finally, the dual variables are updated in 
the same manner as for the basic ALMM (see Algorithm [1]) , and the process is repeated. Notice that a key 
feature of ADMM is that the blocks of variables Xi and X 2 are updated in a Gauss-Seidel fashion, i.e., the 
updated values for the first block of variables are used to define the subproblem used to obtain the updated 
values for the second block of variables. The motivation for the design of ADMM is that each subproblem 
(see ((5a1) and (IFbl) ') should be substantially easier to solve than the subproblem (see ([1])) used by ALMM. 
For many important applications, this is indeed the case. 

The interest in ADMM has exploded in recent years because of applications in signal and image process¬ 
ing, compressed sensing m, matrix completion |22] . distributed optimization and statistical and machine 
learning [3], and quadratic and linear programming [3]. Convergence of ADMM has even been studied for 
specific instances of nonconvex functions, namely consensus and sharing problems El- 

A natural question to ask is whether ADMM converges when there are more than two blocks, i.e., when 
n > 3. The authors in show via a counterexample that ADMM is not necessarily convergent if n = 3. 
However, they also show that if n = 3 and at least two of the matrices that define the linking constraint 
m are orthogonal, then ADMM will converge. In a different paper [5], the authors show that ADMM will 
converge when n = 3 if at least one of the functions /i in (llap is strongly convex. 

Other works have considered the more general case of n > 2. For example, an ADMM-type algorithm 
for n > 2 blocks is introduced in [2^ , where during each iteration a randomly selected subset of blocks is 
updated in parallel. The method incorporates a “backward step” on the dual update to ensure convergence. 
Hong and Luo m present a convergence proof for the n block ADMM when the functions are convex, but 
under many assumptions that are difficult to verify in practice. Work in m shows that ADMM is convergent 
in the n block case when the functions fi for i = 1 ,..., n are strongly convex. 

The Generalized ADMM (G-ADMM) 

Deng and Yin [8] introduced G-ADMM, which is a variant of ADMM for solving problems of the form ([T]) 
when n = 2 and the functions fi are convex. They proposed the addition of a (general) regularization 
term to the augmented Lagrangian function during the minimization subproblem within ADMM and the 
addition of a relaxation parameter 7 to the dual variable update. Their motivation for the inclusion of a 
regularization term was twofold. First, for certain applications, a careful choice of that regularizer lead to 
subproblems that were significantly easier to solve. Second, the regularization stabilized the iterates, which 
has theoretical and numerical advantages. 
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Their method is stated below as Algorithm [31 It uses, for any symmetric positive-definite matrix M and 
vector z, the ellipsoidal norm 

WzWh := z^Mz. ( 6 ) 


Algorithm 3 G-ADMM for solving problem o when n = 2. 

1 : Initialization: a;® S G G R™, iteration counter k = 0, parameters p > 0 and 

7 G (0,2), and regularization matrices Pi G and P 2 G R'^ 2 xAf 2 ^ 

2 : while the stopping condition has not been met do 
3: Update the primal variables in a Gauss-Seidel fashion; 


Jk+i) 

JLl 

Jk+i) 

J -2 


argmm Cp{x,xf"’+ \\\x - 
argmm Cp{x^i^^\x]y^^'^) + i||a; - 


(7a) 

(7b) 


4 : Update the dual variables: 

y(fc+l) ^ y(fc) _ ^p{Ax^k+l) _ 

5: Set k k + 1. 

6; end while 


The authors prove [5] that Algorithm [3] converges to a solution from an arbitrary starting point as long 
as the regularization matrices Pi and P 2 in (ITal) and dig satisfy certain properties. We stress that the 
convergence analysis for G-ADMM only applies to the n = 2 case. 


The Jacobi ADMM (J-ADMM) 

Deng et al. [7] have extended the ideas first presented in G-ADMM [ 8 ]. Their new J-ADMM strategy (stated 
below as Algorithm 0]) may be used to solve problem ([T]) in the general case of n > 2 blocks. Note that (ISall 
is equivalent to the update 


„(fc+i) 


^ argmin £p(; 


Jfc) 




1 , Xj, , • ■ 


(fc) 


,xW;yW) + i||a;.- 


„(Di 


2 

Pi’ 


(where Pi G is a regularization matrix) which we state in order to highlight the relationship of their 

method to the previous ones. We also comment that the form of the update used in (l 8 al) motivates why 
their algorithm is of the proximal type. 


Algorithm 4 J-ADMM for solving problem CD for n > 2. 

1 : Initialize: G R^, G R"*, iteration counter k = 0, parameters p > 0 and 7 G (0,2), and 

regularization matrices Pi G for i = 1,..., n. 

2: while stopping condition has not been met do 
3: for i = 1,..., n (in parallel) do 

^ (fc) -| 

^ argniin|/,(a;i) -f ^\\A,x,+ '^Ajxf'’ - b-—\\l + -\\xi - xf"'’fp.^ ( 8 a) 

4; end for 

5; Update the dual variables: 

yik+i) yik) _ ^p(^Ax^k+i) _ 

6 : Set k k + 1. 

7; end while 
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In [7], the authors establish global convergence of J-ADMM for appropriately chosen regularization 
matrices Pi. Moreover, they showed that J-ADMM has a convergence rate of o(l/k). 

1.2 Our Main Contributions 

We now summarize the main contributions of this work. 

1. We present a new flexible ADMM algorithm, called F-ADMM, that solves problems of the form ([T|) 
for strongly convex fi, for general n > 2 based on a Gauss-Seidel updating scheme. The quadratic 
regularizer used in F-ADMM is a user defined matrix that must be sufficiently positive definite (see As¬ 
sumption 2]) , which makes F-ADMM flexible. For some applications, a careful choice of the regularizer 
makes the subproblems arising within F-ADMM significantly easier to solve, e.g., see the discussion in 
[HI Section 1.2] and [T] Section 1.2]. We prove that F-ADMM is globally convergent in SectionjHl 

2. We introduce a hybrid Jacobi/Gauss-Seidel variant of F-ADMM, called H-ADMM, that is partially 
parallelizable. This is significant because it makes H-ADMM competitive in a big data setting. For 
H-ADMM, the blocks of variables are gathered into multiple groups, with a Gauss-Seidel updating 
scheme between groups, and a Jacobi updating scheme on the individual blocks within each group. We 
demonstrate that H-ADMM is simply F-ADMM with a particular choice of regularization matrix, and 
thus the convergence of H-ADMM follows directly from the convergence proof for F-ADMM. 

3. We show that if the n blocks of data are partitioned into two groups, then H-ADMM can be applied to 
convex functions fi, rather than strongly convex functions. In this special case, with carefully chosen 
regularization matrices, H-ADMM extends the algorithm in |8] from the n = 2 case, to the case with 
general n, and convergence follows directly from the results presented in [5]. 

1.3 Paper Outline 

In Section 2] we present our new flexible ADMM framework and show that any instance of it is globally 
convergent. In Section [3] we consider a particular instance of our general framework, and proceed to show 
that it is a hybrid of Jacobi- and Gauss-Seidel-type updates. We also discuss the practical advantages of this 
hybrid algorithm, which includes the fact that it is partially parallelizable. Finally, in Section 2] we present 
numerical experiments that illustrate the advantages of our flexible ADMM framework. 


2 A Flexible ADMM (F-ADMM) 


In this section we present and analyze a new F-ADMM framework for solving problems of the form (|T]). For 
convenience, we define the vector 


:= 


V(fc)' 

U'J 


(9) 


Our analysis requires several assumptions concerning problem ([I} that are assumed to hold throughout. The 
first of which uses df{x) to denote the subdifferential of / at the point x, i.e.. 


df{x) := {s G I {s,w — x) < f{w) — f{x), Vw G dom/}. 


( 10 ) 


where dom/ = {x : f{x) < c»}. Moreover, 

dfi{xi) := {s, G I {si,Wi - Xi) < f,{w,) - fi{xi), Vw, G dom/,}. (11) 

We also require the following definition of strong convexity. A function fi : R U {-l-oo} is strongly 

convex with convexity parameter > 0 if for all Xi,Wi G dom/,, 

fi{wi) > fi{x,) + {df,{xi),Wi - Xi) + y Ik* - a:*||2- (12) 

We may now state our assumptions on problem O- 
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Assumption 1. The set of saddle points (equivalently, the set of KKT-points) for ([T]) is nonempty, i.e., 

U* := {u* G : u* = {x*,y*), A^y* G df{x*), and Ax* -b = t)}^ 0. 

Assumption 2. The function fi is strongly convex with strong convexity constant p,i > 0 for i = 1,... ,n. 

If Assumption [T] does not hold, then ADMM may have unsolvable or unbounded subproblems, or the 
sequence of Lagrange multiplier estimates may diverge. In particular, x* is the solution to (HD and y* is a 
solution to the associated dual problem. Assumption [5] allows us to define 

yt := min yti > 0 (13) 

l<2<n 


as the minimum strong convexity parameter for the functions as well as use the following lemma. 

Lemma 3 (Strong monotonicity of the subdifferential. Theorem 12.53 and Exercise 12.59 in [17]). Under 
Assumption\^ for any Xi,Wi G dom.fi we have 

(si-U,x^-w^) > yti\\x^-w^\\l, 'isiedfi{xi), Uedfi{wi), i = l,...,n. (14) 


The following matrices will be important for defining the regularization matrices used in our algorithm, 
and will also be used in our convergence proof. In particular, we define the block diagonal matrix Ajo, and 
the strictly upper triangular matrix Aa as 


A/\ 


-1 

to 

s 

_1 


'Ai 


and Ajo := 


A 





A„_ 


where {Aa, Ai)} C We then have the strictly (block) upper triangular matrix 


A^Aa = 


AfA2 ... AfA„ 


A^_iA„ 


G R 


NxN 


(15) 


(16) 


Notice that A^Aa is equivalent to triu^(A^A), where triu'''(A) denotes the strictly upper (block) triangular 
part of X. We are now in a position to describe the details of our F-ADMM method. 

2.1 The Algorithm 

Our F-ADMM method is stated formally as Algorithm [Sj As for J-ADMM, F-ADMM requires the choice 
of a penalty parameter p > 0 and regularization matrices {Pi}(Ui. Our convergence analysis considered in 
Section [2^ requires them to satisfy the following assumption that uses the definition of p. in (|13l) . 

Assumption 4. The matrices Pi are symmetric and satisfy Pi >- I^IIA^AaIIII for all i = 1,... ,n. 


6 








Algorithm 5 F-ADMM for solving problem ([T]). 

1: Initialize: ^(0) g yiO) g 

iteration counter fc = 0 , parameters p > 0 , 7 € ( 0 , 2 ), and matrices 

{Pi}7=i satisfying Assumption SI 
2: while stopping condition has not been met do 
3: Update the primal variables in a Gauss-Seidel fashion: 


Jfc+i) 


(^) 1 

argmin|/i(xi) + 11| AiXi + ^ ^ - b - ^\\l +-^\\xi - xf'' 


i =2 




2-1 


(^) 1 

arginin|/j(x*) + |||AiX,+ ^ ^ Aixf^ -b --—II 2 + 2 ' 




i=i 


Z=2+l 


n—1 




argrnin |/„(x„) + |||A„Xn + ^ - h - 


1,2 1 


i=i 


+ -||x„ - xl''^||p_ I 


.12 "C r, ‘*'n IIP, 

P 2 


4 : Update the dual variables: 

y(fc+l) ^ y{k) _ ^p(yl^(fe+l) _ b) (17) 

5: Set k ^ k + \. 

6; end while 


We now describe the A:th iteration of Algorithm [5] in more detail. For fixed dual vector , the cur¬ 
rent point x^*) is updated in a Gauss-Seidel (i.e., a cyclic block-wise) fashion. To begin, decision vectors 
X 2 ^\... ,xi^^ are fixed, and the hrst subproblem in Step [3] is minimized with respect to xi to give the new 
point Similar to before, we note that the Ah subproblem in Step [3] is equivalent to 


(fc-i-i) . ^ / 

x) ^ argmm Up (a 


,(fc+i) 


„(fe+i) 


(fc) 


)2:i,Xjg2) 


,xW;yW) + i||xi- 


„(fc)i 


(18) 


Next, the second block X 2 is updated using the information obtained in the update of the first block xi. 
That is, the vectors Xg^^, •. •, x^'^ remain fixed, as does , and the regularized augmented Lagrangian is 

minimized with respect to X 2 to give the new point X 2 . The process is repeated until all n blocks have 
been updated, giving the vector . Finally, the dual vector is updated using the same formula as 

in J-ADMM (see Algorithm |3|) . Steps |3] and 0] are repeated until a stopping threshold has been reached. 

Remark 5. It is clear that Algorithm\^uses a (serial) cyclic block coordinate descent (CD) type method to 
update the primal vector x. That is, in Step\^ of Algorithmic a single pass of block CD is applied to the 
current point x^^'l to give the new point and then the dual vector is updated. 


2.2 Convergence 

To analyze F-ADMM, we require the block diagonal matrices G^, and G defined as 


G, := 

'Pi 

and G := 

G, 


Pn. 


L 7P J 


(19) 


where / is the (appropriately sized) identity matrix, and 7 G (0, 2) and p > 0 are algorithm parameters. The 
following result gives sufficient conditions for declaring that a limit point of problem m is optimal. 
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( 20 ) 


Lemma 6. If 1C is any subsequence of the natural numbers satisfying 

limwW=u^ and lim ||g = 0 

kGK fce/c 

for some limit point , then S U*, i.e., solves problem m- 
Proof. Let us first observe that the two limits in ([201) jointly imply that 

r,L\ 


lim =u^ = 


keK 


y 


( 21 ) 


Also, it follows from (EUl) and the definitions of (see dH])) and G (see ((HI)), that = 0. 

Combining this with (fT71) . (1^ . and (1^ establishes that 


b = lim = Ax^ = lim 

k&K k&K 


( 22 ) 


so that, in particular, x^ is feasible for problem (HD- 

Next, the optimality condition for the ith subproblem in Step [3] of Algorithm [5] ensures the existence of 
a vector gi{x^^'^^'^) £ dfi(xp^^) satisfying 


i-l 






i=i 


Z=2+l 


t —1 n 

= udA*") - Aly^“> + pAr(A„r'> + + E '»'-!'> -1-) + riA*" - »!*’) 

J = 1 l=i+l 

= mA*") - + ± a,A> - /ix‘) + p,(x<*+‘> - 4% 


i=i 


l=i+l 


where we also used (1^ to substitute for b in the last equation. Using Ax^ = X)j=i rearranging 

the previous equation gives 


I n 

giixp^'’) =Afy<'^^ - pAf -^j)+ Mx[''^ -xf)J -P*(xp+^^ -a;f^)- 


1=1 l=i+l 

By taking limits over the snbsequence 1C of the previous equation, and nsing (1201) and dUD, we know that 

(23) 


l^xag^{xp^'’) =Ajy^. 

k^K 


We may then use gi{xp^'^) £ dfi{xp^''), (1^ . (l23l) . and [HI Theorem 24.4] to conclude that 

Ajy^ £ 9/*(xf). 

Combining this inclusion, which holds for all 1 < 1 < n, with (1221) shows that is a KKT point for problem 
(HD, and thus is a solution as claimed. □ 

Our aim is to combine Lemma M with the next result, which shows that the sequence {||Mfe — m*||g} is 
nonexpansive with respect to any u* £ U*. We note that the proof is inspired by that for J-ADMM [7]. 

Theorem 7. Let Assumptions^ an d\^ hold. Then, for any u* £ U* and all k > 1, there exists a constant 
77 > 0 such that 

||„W -^*11^ - -11*11^ > (24) 

with defined in ((9D and G defined in HID- 


Proof. At each iteration of Algorithm [SJ a subproblem of the following form is solved for Xi: 


Jfe+i) 


i=i 


i=i+l 


(^) 1 

argmin {fi{x,) + ^ Aix['"'’- b - ^\\l +-\\xi - xl'"^\\p,y (25) 


The first order optimality condition for (1251) is 


,(fe) 


oea/.(xr‘>)+Mni:. 4 A“’+ 1 : + p.(,(‘+»_,<*)). 

,t=l l=i+l ^ 


and rearranging gives 

( (k) n \ 

i+y E /lix™ +p.(x<‘>-xr‘’)ea/.(xr'>r 

^ i=i I=i+1 J 

Noting that J27=i+i = — Yl]=i defining y := — b) gives 

Afy-pAj ( A,(a:f- xf+'’) S +')). 


U=i+1 


Using Lemma [3] with Afy* G dfi {x *), we have 


-^:iii < - -i^(y - r) - paj E ) 

\ t=i+i / 

Now, summing the previous inequality over all blocks i gives 

n n 

/r||a;('=+i) - < {a{x^>^+^) - x*),y- V*) - " <)> E 


j=i+l 


+^(x;'-«>-x-)’-p.(x<«-x<‘«'), 


2=1 

where fi > Q is defined in (USD. Notice that, by (II3 the following relation holds 

A(x('=+i)-x*) = ^(y«-y('=+i)), 

and we also have that 

y-y* = {y- y^’^^Y + - vl = ^( 2 /^"^ - 22 ^'=+'’) + ( 2 /^'=+'^ - 22 *)- 
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Then (1^ becomes 


(26) 


(27) 


(28) 




- x*YP,{xf^ - 

A*")-pT.( 


1 

* 

M 


2=1 


2=1 


j=i+l 

llTj 

> 


x*\\l--{y^^'> 

-y('=+i),y-y*) 





IP 


- ’■/•'I + V 


l 28 ll 


x*\\l--{y^^^ 

IP 

22('=+i),22('=+i)- 

-h^'^^-y' 

p 


(29) 
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Using the identity 


-X*), ^ Aj{. 
i—1 j=i-\-l 




(30) 


we may deduce from (1^^ and the definition of Gx that 

71 

i^l 

= - X*, - x('=+i))) 

1 

IP 


> /.||x(''+l) - x*||2 - —(y('=) - ^(^=+1), - y*) 

IP 

■ 1 - T' ||yW _ y(fc+i) ||2 + p(^^(k+i) _ - x('=+i))). 


2 

IP 

Then, by rearranging we have 

(y(fe+i) -u*)G{u^'^^ > /x||x('=+^) -a^*lli + 

Yp 

+ p(Jx('=+i) - x*,ylSAA(x('=) - x('=+i))^, 
where G is defined in (1191) . Combining the relation 
II^W _ u*\\l - - u*\\g = 

with (l32l) gives 

,(C _ „*l|2 _ ||,,(fc+l) _ „*||2 


(31) 


(32) 


u''-' -u-\\g - I|U'-' - W|Ig > -a:1li + 2-^||y('=) - 111 + - w^'^^^^IIg 


. 1 -7 i 

Yp 

+ 2p^x('=+i) -x*,A|;24a(x('=) -x('=+i))^ 

= 2m||x('=+i) - x*||l + ^|| 2 /('=) - 111 + ||xW - x^'^+i)! 

IP 

+ 2p(Jx('=+i) - x*,AD71a(x('=) - x^'^+i))^ 

Notice that, because ^ > 0, the following holds: 

2p(x^^+^'^ - x*,AlAYx^'^^ - x^'^+i))^ > -2/i||x('=+i) - x*||l - ^\\AIAYx^^^ - 
Now, combining (1551) and (1551) gives 


II- 


(33) 


(34) 


-uU|2 


(fe+l) _ „*||2^ 


Q — || li ^ ^ — U 

2-7 


> 


IIj^W _yU+i)||2 + II^U) _^(fc+i)||^ _ ^pSAaIIIIIxW -x('=+i)|| 

2p 


2-7 

Yp 


ll„m _ ^ 'iA 


2=1 


(35) 


and note that Assumption |4] guarantees that 


r.:=P.-^||ASAA||lJ^0. 
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If we then let rji := XminiTi)/\\Pi \\2 > 0, we have from the definition of Ti and standard norm inequalities 


l^(fc) _ ™(fc+l) ||2 _ II (fe) _ (k+l) ||2 




^ / (fc) (fe+i)\ro/ (fc) (fe+i)\ M v 

> - x) ') P^{x\ ’ -x\ ’)= r]i\\x\ - x\ 


(^) ^('=+l)||2 


Combining this with (l3^ gives 


IlyW _y*||2 _ |i (fc+l) _y*||2 > 


2-7, 


G / - y^''^'^ll2 + E^^ll^l - 




From the previous inequality and the definition 


T] := min 


2-7 


y l<i<n 


min rii>>0 


we have 


II^W _ u*\\l - ||u('=+i) -ufG>v{ ^||y« - ^ ||xf^ - ] = 77||u« - 


G> 


i=l 


which is the desired result. 

We we may now state our main convergence result for Algorithm [5j 


□ 


Theorem 8. If the conditions of Theorem^ hold, then the sequence generated by Algorithmic 

converges to some vector that is a solution to problem O- 

Proof. Let u* be any solution in U*. It then follows from Theorem [7] that 

||y(fc) _ m*||q < — u*||g for all fc > 1, (36) 

so that is a bounded sequence. Moreover, for any integer p > 1, it follows from ([7]) that 

E < E (ll«^'^ - “IIg - 11^^'^+'^ - «1Ig) = 




/c=l 


Taking limits of both sides of the previous inequality as p —cx) shows that the sum is finite, and since all 
the summands are nonnegative that 


lim ||u('=) |1g = 0. (37) 

k—¥oo 

Next, using the boundedness of we may conclude the existence of a subsequence /C C {1, 2,... } 

and a vector G ^N+m 

1™^^^^=^^. (38) 

It follows from dMi), dazi), and Lemma |6] that ul is a solution to problem ([IJ. Finally, since (1241) held for any 
u* G U* and we have proved that G U*, it follows that limfc_).oo as desired. □ 
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3 A Hybrid ADMM (H-ADMM) 

One of the disadvantages of a Gauss-Seidel type updating scheme within ADMM is that it is inherently serial. 
With problem dimension growing ever larger in this era of big data, and the ubiquity of parallel processing 
power, a Jacobi type updating scheme may be preferable in many real-world instances of problem ([Ij. The 
purpose of this section is to show that if F-ADMM is applied to “grouped data”, and a special choice of 
regularization matrix is employed for each group, then Algorithm [S] becomes a hybrid Gauss-Seidel/Jacobi 
ADMM-type method. Therefore, Algorithm [5] is partially parallelizable. 

3.1 Notation and Assumptions 

Suppose that the function /(x) is separable into n blocks, as in (ITal) . Then, we can (implicitly) partition the 
variables Xi and functions fi{xi) together into i < n groups. For simplicity of exposition, we will assume 
that n is divisible by some p, so that £p = n, which means that we form £ groups of p blocks. Then, problem 
(HD is equivalent to the following partitioned problem: 


with 


i 


minimize 

xGR" 


(39a) 


£ 


subject to 

AjXj = b 

(39b) 


i=i 


Xi 






, X2 := 


, ... X£ := 


_Xp_ 


X2p 


Xn 


2p 


fi(xi) := f2(x2) := ^ /*(xi), 

i—p+1 


2=1 


f^x^) := ^ f,{xi) 

i={£-l)p+l 


(40) 


and 


Ai [Ai ... Ap\ , A 2 [Ap+i ... A2p] , ... Af.— ... A„] . 

Notice that A = [Ai,A 2 , _4^] = [Ai, A 2 , ..., A„], and x = [xf ,..., xj]"^ = [xf,..., x^]^. Furthermore, it 

will be useful to define the index sets 


5i ={!,...,p}, S2 = {p+l,...,2p}, ... Se = {{£-l)p+l,...,n} (41) 

associated with the partition described above, and to use the notation Sij to denote the ji\i element of Si. 

We now think of applying Algorithm^to the £ groups of data. That is, in Step 3 of Algorithm^we have 
£ minimization problems, one for each of the grouped data points Xj (rather than n minimization problems, 
one for each of the individual data blocks x^)- For the grouped data, we require regularization matrices 
7^1,..., Vi, for each of the £ groups; these matrices will be crucial in our upcoming derivation. 

To motivate the idea of “grouped data”, and to make the ideas that will be discussed in this rest of this 
section more concrete, we give a specific example that shows how our hybrid algorithm will work. 

Example 9. Suppose there are n = 12 blocks and we have access to a parallel computer withp = 4 processors. 
We make a formal partition of the data into £ = 3 groups. That is, we set fi(xi) = f 2 (x 2 ) = 

andisi^s) = = \xj,...,xlf, xa = \xl,...,xlf, and = [x|’, ..., 
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partition the matrix A accordingly, and initialize index sets Si = {1,...,4}, S 2 = {5,..., 8} and S 3 = 
{9,..., 12}. Then, a single iteration of H-ADMM (see Steps 3-7 of Algorithmic will run in the following 
way. The Lagrange multiplier estimate y^^'> and (group) variables and are fixed. Group variable 
Xi is updated by solving a subproblem of the form (EH) for each of xi,..., 3:4 in parallel. This gives the new 
point . Then, and Xg*^ are fixed, and four subproblems of the form (E21) are solved for each of 

X5,... jXg in parallel, giving Next, and are fixed, and four subproblems of the form 

(EH) are solved for each of xg,... ,xi 2 in parallel, giving Xg^'*’^^. Finally, is updated in ESD- 

Example [5] shows that Algorithm [5] is running a Gauss-Seidel process on the group variables, but run¬ 
ning a Jacobi process to update the individual blocks within each group. This example shows an efficient 
implementation in the sense that, by ensuring that the group size p matches the number of processors, all 
processors are always engaged, and that updated information is utilized when it is available. 

In the rest of this section we explain how H-ADMM (Algorithm [6]) is obtained from F-ADMM. 


3.2 Separability Via Regularization 

We show that, if the regularization matrices are chosen appropriately, F-ADMM can be partially 

parallelized, and forms the hybrid algorithm H-ADMM. In particular, for the ith subproblem in F-ADMM 
(applied to the grouped data in ESI); the p blocks within the ith group can be solved for in parallel. 

In what follows, we use the relationships 


II ^ ^ j 11 2 ^ ^ ^3 ) 'y 4l;X;) and 


p p 

jXSij,As,^iXSij), 

j€Si l€.Si j — 1 1 = 1 

1^3 ’■7^3 


(42) 

(43) 


which can easily be verified. Using the definition of Ai and a similar reasoning as for (l42|l . it follows that 


ii- 4 ix,i |2 = II AjXjWi = 11^32^3112 + '^(^ 3 ^ 3 ^ 

3&Si jGSi j&Si i€Si 

1^3 


(44) 


We now define := 6 — X)q=i ~ J2i=i+i and notice that is fixed when minimizing the 

augmented Lagrangian with respect to group x^. Recalling Algorithm [S] and (IT^ . and using EH; the update 
for the ith subproblem for our grouped data problem without the regularization term is equivalent to 


(k+i) 


■ / • r ( (^+1) (^+1) (^) 

x; '=argmm£p(x( ',... ,x)_;^ %Xi,x)/^,... ,x 


y(fc)) 


= argrnin |f*(xj) - ,A^yii - b^) -|- ^||Axi - b^ll^j 
= argrnin |fi(xi) - - b*) -1- ^\\hi\\l + ^HAxiHa “ p(Axj,bj)| 

= argrnin - {y^''\A^^^ - b,) - p(AiXi,bi) + | XI H^f^^ill^ + f X X^^'2:z, AjXj)|. (45) 

w 


Notice that it is the final term in (I45|) that makes the minimization of the augmented Lagrangian (with 
respect to the group x^) non-separable; it contains a cross product term, which shows interaction between 
different blocks of variables within the fth group indexed by Si. 
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3.2.1 Defining the group regularization matrices 

We eliminate the non-separability in (HSl) by carefully choosing the regularization matrices {ViYi^i- From a 
practical perspective, if the problem is made separable, then the individual blocks within the fth group can 
be updated in parallel. To this end, we choose the matrix that defines our regularizer to be 


:= 


PSi,i 

PSi,2 




-P^S.i^Si, 




(46) 


We remind the reader that the matrices {Pst j Yj=i used to define Vi are user defined symmetric matrices that 
must be chosen to be sufficiently positive definite, to ensure that convergence of F-ADMM on the grouped 
data is guaranteed. Before we formalize our assumption, we require the definitions 


Aa '■= 


A 2 ... Ai 


Ai 

Ai 

and Ad ■= 

Ai_ 


where {Aa,Ad} C We then have the strictly (block) upper triangular matrix 


(47) 


AIA2 ... AjAt ' 


A HI A A — 


Aj_,Ai 


€ R 


NxN 


(48) 


Notice that the definitions of Ad, Aa and AJ)Aa in (|47ll and (l48)l . are analogues to Ad, Aa and defined 

in dm) and (HU). We are now ready to state our assumption on {Vi}Yi, which is actually Assumption 2] 
applied to the grouped data problem (|5^ . 

Assumption 10. The matrices Vi are symmetric and satisfy Vi >- for all i = 1, ... ,i- 

Importantly, if F-ADMM is applied to the grouped data problem (l39l) and Assumption ITOl holds, then 
convergence is automatic, i.e., convergence of F-ADMM equipped with Assumption 1101 applied to problem 
(l39l) follows directly from the convergence results presented in Section [2j 

3.2.2 Incorporating the regularization term 

Now that the regularization matrices {Vi}Yi are defined, we return to the non-separability encountered in 
(H5|) . Recall that the subproblem in Step 3 of F-ADMM (Algorithm [S|) is equivalent to (fT51) . which in turn 
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is equivalent to (l4^ + ^||xi — We concentrate on the regularization term, and notice that 


-Xj T’iXj 


goJ+iIlT} 1 




Ps.„xs,„ - pAl^ Jjy,=i As,,,xs,,^ 

^ j^p ' 


P P 




i=i 


j — l 1 = 1 


Hsl 1 




3^Si 


Following a similar argument, we can write 


n 2 




(49) 


xfV,x\’^> =J2^s,jPsijxfl - P^^{As,,,xs,,,,As,^,x^sl) 


i=i 


j — l 1 = 1 
1^3 


= H ^JP3^j -pJ 2 ^^ 3 ^ 3 )- 


3 eSi 




We may now use M and dSni) to write 


511 ==. - = 5 E ii='jiit, - E + 5 E 

je5i je5i j&Si 

P 


Xj IIP,- 


(50) 


+ pJ2 ^i^‘^ 1 ^^ 3 ^ 3 ) ~^J2 ^{AixuAjXj) 

j^Si l&Sj^ 

I ¥^3 1¥3 1¥3 


This may be equivalently written as 


1, 


- Xi - x,^ 


( fe ) ||2 


* ~ \\Vi 


1 


- 5 I: 


2 ''^3 ^3 llPj 

3&Si 3&Si leSi 

1¥3 


(k) ||2 


+ pY -^Y Y^^^^^'^3^3) -^Y Y^^‘^l’^ 3 ^j)- 


jGSi l€.Si 


j^Si iGSj^ 


By adding this regularization term, i.e., ^Hx^ — x^^ |||,., to the objective function in (1^ . we obtain (ignoring 
terms independent of x^) the F-ADMM update 


(k+i) 


argrnin|fi(xi) - {y^'‘\AiXi - bi) - p{AiXi,hi) 


+ 2'^ lia + pY '^3^3) + \Y ^ ll^i 

j&Si jeSi ieSi j&Si 

1¥3 


}■ 


which is equivalent (again ignoring constant terms) to 

xf+^^ = argmm^ |/j(a;j) - {y^'^\AjXj) - p{AjXj,hi) + ^\\AjXj\\l + ^ ^3^3) + \\\x3 “ a^f^llp,} 


3^Si 


iGSi 

l^J 


(^) 1 

argmin^ \^fj{xj) + ^\\AjXj + Y “ b* - ^||1 + ^IIp,}- 


(51) 


3^Si 


leSi 

1¥3 
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The regularization matrix Vi, defined in (H51) . has caused the cross-product term to be eliminated from (HSl) 
(recall that (l45l) was the update without using the regularization term), and subsequently the subproblem 
for updating is separable into p blocks (one solve for each j G Si). That is, the decision variables 

Xj for j G Si can be solved for in parallel. This updating strategy forms our hybrid algorithm H-ADMM, 
which is able to use a combination of both Jacobi and Gauss-Seidel updates. We emphasize that H-ADMM 
is a special case of Algorithm [Sj where the blocks of variables have been (implicitly) grouped together as in 
(l39)) . and the regularization matrices have the form (l46l) . 

3.2.3 The H-ADMM Algorithm 

The following is a formal statement of our H-ADMM algorithm. Recall that H-ADMM is a special case of 
F-ADMM, and convergence of H-ADMM follows directly from the convergence theory for F-ADMM. 


Algorithm 6 H-ADMM for solving problem ([T]). 

1 : Initialize: x G R^, G R”", iteration counter k = 0, parameters p > 0 and 7 G (0,2), data 
partition index sets {Si}^^^, and regularization matrices satisfying Assumption [Till 

2 : while stopping condition has not been met do 
3: for i = 1,..., f in a Gauss-Seidel fashion solve do 

4: Set hi ^b- “ Y)i=i +1 . 

5: for j G Si (in parallel) do 


,(fc+i) 

'3 


•<— argmin 

Xj 


3&Si 


+ 


leSj 



1 

2 




(52) 


6 : end for 

7: end for 

8 ; Update the dual variables: 

y(fc+l) ^ y{k) _ r^p(^Ax^k+l) _ (53) 

9: Set fc •<— fc J- 1. 

10 : end while 


The groups of data are updated in a Gauss-Seidel scheme (see the for loop in Step [3]), while the individual 
blocks within each group are updated in a Jacobi (parallel) scheme (see the inner for loop in Step[5|). 

We have presented H-ADMM as a (serial) Gauss-Seidel algorithm that has an inner loop which can be 
executed in parallel, i.e., H-ADMM is partially parallel. However, we can also view H-ADMM as a fully 
parallel method that occasionally inserts updated information during the update from x*^^) to . This 

shows that H-ADMM is extremely flexible. 

Remark 11. Notice that the regularization matrix (1461) is not explicitly formed in H-ADMM (Algorithm\^. 
Specifically, H-ADMM only uses the matrices Pg. 1 ,... ,Psi p, which lie on the main (block) diagonal of Vi. 
Therefore, as for F-ADMM, only one Ni x Ni regularization matrix Pi is required by H-ADMM for each of 
the i = 1,... ,n (individual) blocks. 

Remark 12. The update ()52D in H-ADMM has the same form as the update (I8a|) in J-ADMM (Algorithm^ 
for all j G Si. Therefore, a single iteration of H-ADMM has essentially the same cost as that of J-ADMM. 
However, H-ADMM has the advantage of using the most recent updates when updating the groups of data. 

3.3 Computational Considerations 

Parallel algorithms are imperative on modern computer architectures, which is why, at face value, Jacobi-type 
methods seem to have significant advantages over Gauss-Seidel-type competitors. The H-ADMM (Algorithm 
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ED bridges the gap between purely Jacobi or purely Gauss-Seidel updates, finding a balance between ensuring 
algorithm speed via parallelization and allowing up-to-date information to be fed back into the algorithm. In 
this section we describe how to choose the number of groups I and group size p to “optimize” H-ADMM from 
a computational perspective. Moreover, we show that H-ADMM is competitive compared with J-ADMM. 

Consider a big data application where the number of blocks n is very large. Moreover, suppose we have 
access to a parallel machine with p processors, where p < n (or even p n). Again we will assume that 
n = ip, and the n blocks are organized into i groups of p blocks. We stress that the number of blocks in each 
group is the same as the number of processors. 

To implement H-ADMM we first initialize bi. Then, take the first group of p blocks and send one block 
to each of the p processors. These p blocks are updated in parallel as in (15^ (Step 5 of Algorithm E]) . 
Once these p blocks have been updated, we have the updated group variable consisting of individual 

blocks ..., We then form b 2 as in Step 4 of AlgorithmEl Notice that b 2 incorporates the new 

information from the updated block via the term i.e., we feed the updated information back 

into the algorithm. Now, the next group of p blocks are sent to the p processors to be updated, giving x^ 
consisting of individual blocks This new information is then fed back into H-ADMM via 

the vector ba. The process is repeated until a full sweep of the data has been completed, i.e., all n blocks 
have been updated. 

In this way, our H-ADMM algorithm has (essentially) the same computational cost as J-ADMM, because 
the data blocks have been grouped in an intelligent way that takes advantage of the processors available. (For 
J-ADMM, the data blocks also need to be sent to processors in groups of p, it is just that, for J-ADMM, there 
is no need to update the vector b^ between the £ sweeps of the processors.) We note that for J-ADMM, the 
matrix-vector multiplication is computed once all n blocks of x have been updated (i.e., once 

is available), whereas for H-ADMM, the computation of has been split and performed in stages 

with the vectors (for i = !,...,£) computed after each group of data has been updated and the 

sum taken just before the dual variables are updated. Again, this shows that H-ADMM and J-ADMM have 
approximately the same computational cost, but H-ADMM has the advantage of new information becoming 
available to the algorithm, which has the potential for H-ADMM to be more efficient. 

Remark 13. Notice that if n <p, then H-ADMM is essentially equivalent to J-ADMM (applied to strongly 
convex functions) if we take £ = 1 and replace Stepl^ with bi = b. 


3.3.1 An efficient implementation of Steps 4 6 in Algorithm [6] 

Algorithm El was written to match our presentation in the text. However, in practice, it is computationally 
advantageous to perform Steps 4-6 in a different, but equivalent way. To that end, consider the middle term 
in the minimization subproblem (I52|l . and using the definition of b^ (Step 4 in AlgorithmEl we have 


(k) 

iGSi ^ q—1 5=^+1 

ijj ijj 


(fe) 




II 


(k) 

x) 

+ E + 

i-1 

/ > 

-b ^ 

,{k) 

n 




leSi 

q^l 

s=i+l 

P 

II 

~ 

(k) 

x) 

+ Axf> + E 

A _|_ 

-^q-^q 1 






9=1 

s 

=*+i P 


II 

~ 

(k) 

x) 

9=1 

+ ^AxW 

s=i 

-b-y^. 

p 

(54) 

(IM)) are 

fixed 

with respect to j G 

Si, so we can combine them into a 

single 


vector Vi say, and rewrite Steps 4-6 in Algorithm El as follows. 
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Algorithm 7 An efficient implementation to replace Steps 4-6 in H-ADMM. 


1: Set ^ 4 

2: for j € Si (in parallel) do 




x\ 


Xk+1) 


(55) 


3: end for 


3.3.2 Practical considerations regarding Assumption ITOl 

As discussed in Section [3.2.11 choosing the regularization matrices to have the form for i = 1,..., € in 
a manner that satisfies Assumption (TUI ensures that H-ADMM is globally convergent. However, we have 
remarked that an implementation of H-ADMM only needs the individual (diagonal) block matrices {Pj}'j=x- 
The purpose of this section is to translate Assumntion llOl which is an assumption on the group regularization 
matrices into a practical condition on the matrices {PjlJLi- 

To this end, recall the definition of Vt in (l46l) . If we define 


vF ■■= 


PSi,i + 


Psi,„ + pAg A5 


(56) 


then Assumption [TUI can be written equivalently as Vi = VF — pAf Ai 4 ^Wj^AaW^I ^ which holds if and 

only if VF > pAfAi + ^W^^A^WiP Using pHAilH/ ^ pAFAi, a sufficient condition for Assumption [TOl to 
hold is that 

vF > p\\A,\\ll + Y^\\AlA^\\ll. (57) 

It then follows from the definition of VF that (l57l) will hold (equivalently. Assumption [10] will be satisfied) 
if the matrices {Pj}j^Si are chosen to satisfy 



Vj 


That is, if (l58l) is satisfied for all 1 < i < .^, then H-ADMM is globally convergent. 


(58) 


3.4 A Special Case for Convex Functions 

In this section, we describe how our hybrid algorithm is appropriate for convex functions (i.e., we do not 
need strong convexity) in the case of 2 groups. In particular, it is based on Algorithm [31 which was first 
introduced in [5] and shown to be globally convergent if the regularization matrices Pi and P 2 in ([7]) are 
chosen appropriately. In particular, the convergence theory introduced in [5] holds when Pi 4 0 and P 2 4 0. 

During the remainder of this section, we demonstrate that by choosing the regularization matrix appro¬ 
priately, Algorithm [3] can be extended to handle the n block case while maintaining all existing convergence 
theory. This is done by following the hybridization scheme introduced previously in this section. 

So, suppose that we have an optimization problem of the form o, and that we partition the n blocks 
into 2 groups, i.e., we have 1 = 2 groups and, for simplicity, assume that p = n/20 We can then equivalently 

'^From a practical perspective it is sensible to let the first group have a cardinality that is a multiple of the number of 
processors, and then the second group would contain the remaining blocks. 
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write our problem in the form (I39p where 


and 



Xl 


a^n/2+1 

Xl = 


, X2 = 



J^nj2_ 


Xn 


nil 

fl(Xl) = '^fi{Xi), 




n 

f2(x2) = 

z—n/2+1 


(59a) 


(59b) 


Ai = [Ai ... A„/ 2 ] and A 2 = [^n/ 2+1 • ■ • A^] . (59c) 

It is clear that we can apply Algorithm [3] to the grouped data (l5^ . If we now choose the regularization 
matrices Vi and V 2 to have the same form as in (1461) . we have 


Pi —pAjA2 ... — 

~P-^Ai P2 '. 

~P^n/2^i . Pn/2 


(60) 


and 


Pn/2+1 -P^^/2+l"4n/2+2 

~P^Z/2+2^n./2+l Pn/2+2 

- —P-^Z-^n/ 2+1 



(61) 


Moreover, by letting 



Pi + pAf^Ai 


Pn/2+1 + P^^/ 2 +l^n/ 2 +l 

II 


and V2 = 



Pn/2 + P^n/2^rii2_ 


Pn + pA^An_ 


we have that Vi = Vi — pA^Ai and V 2 = P 2 — pA^A 2 . Thus, a sufficient condition for the matrices Vi 
and V 2 to be positive definite is that 

V?>-p\\Ai\\ll and 7^2^ ^pM 2 ||^/. (62) 

We can then see that a sufficient condition for (15^ to hold is to choose the matrices {Pj)'j^i to satisfy 

Pj >■ pllAill^d for 1 < j < n/2, and Pj >- p||A 2 || 2 ^ for n/2 + 1 < j < n. (63) 

In summary, if the matrices are chosen to satisfy (l63l) . then Algorithm [8] is guaranteed to converge 

for convex functions. We also comment that Pi and P 2 in (1601) and (1611) need not be formed explicitly, since 
Algorithm [5] only requires the block diagonal regularization matrices be chosen to satisfy (1551) . 
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Algorithm 8 H-ADMM(^ = 2) for solving problem ([T]) with a convex objective. 

1 : Initialization: € R^, € R"*, iteration counter k = 0, parameters p > 0 and 7 S (0,2), and 

regularization matrices 

2: while the stopping condition has not been met do 
3: Set Vi = X;Li ^ ^ 

4; for j G {1, , n/2} in parallel do 




G- 


{fjiXj) + - xf^) + Ui| 


1 

2 



5 

6 
7 


end for 


Set V 2 = Aix) 


(fe+i) 




for j G {n /2 + 1 ,..., n} in parallel do 


Jfc+i) 


<- 


lin l^fj(xj) + ^\\Aj{xj - xf'') + U 2 II 2 + 


a;,- — X 


(k) I 


Pi 


(64a) 


(64b) 


8 : end for 

9: Update the dual variables: 

yik+l) ^ y(k) _ ^p(^2.(fc+l) _ 

10 : Set fc ■«— fc + 1. 

11: end while 


Remark 14. An algorithm similar to Algorithm [21 is presented in USf . However, Algorithm [21 is more 
general because the only restriction on the matrices are that they are “positive definite enough”, 

i.e., they satisfy (ESI). On the other hand, the algorithm in requires the regularization matrices to take the 
specific form CipAjAi, where Ai has full rank and Ci> njl for all i = 1 ,... ,n (assuming that the individual 
blocks are partitioned evenly into 2 groups). These latter conditions are more restrictive, and also do not 
necessarily mean that the subproblems arising within their algorithm are easier to solve. For the example of 
li-minimization subject to equality constraints, the regularization matrices in Algorithm\^ can be 

chosen to have the form Pi = tjI — pAf Ai for some Tj, which means that subproblems (I64al) and (I64bl) can 
be solved using soft-thresholding. This is not possible for the algorithm presented in 1121) . For further details, 
see the numerical experiments in Section lJT^ 

4 Numerical Experiments 

In this section we present numerical experiments to demonstrate the computational performance of F-ADMM 
(Algorithm[5]) and H-ADMM (Algorithm[ 6 l), and compare them with J-ADMM [7]. All numerical experiments 
were conducted using Matlab on a PC with an Intel i5-3317U, 1.70GHz processor, and 6 Gb RAM. 

4.1 / 2 -Minimization with Linear Constraints 

In this numerical experiment, we consider the problem of determining the solution to an underdetermined 
system of equations with the smallest 2-norm. Specifically, we aim to solve 

mininiize -||a ::||2 subject to Ax = b. (65) 

We assume that there are p = 10 processors, and then divide the data into £ = 10 groups, each group 
containing p = 10 blocks, with each block of size Ni = 100, which results in A = 10"^ total variables. We also 
note that the objective function in is (block) separable and can be written as f{x) = Efci with 
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n = 100 and Xi € Ni and fi is strongly convex with convexity parameter = 1 for all 1 < i < 100. The 
constraint matrix A € with TO = 3 • 10^ is chosen to be sparse, with approximately 20 nonzeros per 

row, where the nonzeros are taken from a Gaussian distribution. To ensure that b G range(A), we randomly 
generate a vector z G with Gaussian entries, and set b := Az so that the constraints in (IS51) are feasible. 

Notice that for problem (1551) . the subproblem for the jth block of x in H-ADMM can be written as 


Jk+l) 

■^3 


G- 


argmin{i||xj||^ + 


{k) \ 

'.A — X\ ’ 


- Xj ' j + Vi 






( 66 ) 


for j G Si, where Vi is defined in Step [T] of Algorithm [T] (For F-ADMM the subproblems are solved for all 
1 < j < n.) Notice that the update can be found by solving the system of equations 


{Pj +1 + pAj A, )a;f= (P, + pAjA,)xf^ - pAjv, . (67) 

This linear system motivates us to choose the regularization matrix Pj to be of the form 

P,=T,I-pAjA, ( 68 ) 

for some value tj, because it may then be combined with (155)) to give the simple and inexpensive update 


Jk+l) 

■^3 





(69) 


For computational reasons, the fraction Xjlijj + 1) should be computed before multiplication with Xj’^\ 

For all of the numerical results reported below, we give the number of epochs required by J-ADMM, 
F-ADMM, and H-ADMM, and use algorithm parameters p = 0.1 and 7 = 1 . The terminology “epoch” 
refers to one sweep of the data, i.e. that all n blocks of x are updated once. All reported results are averages 
over one hundred runs. The stopping condition used in all experiments is ^||Ax — &||| < 10“^°. 


4.1.1 Results using the values of t that satisfy the theory 

In this section, we give the results of our numerical experiments when Tj that defines Pj in (1681) is chosen as 
dictated by theory. Specifically, we have 

• F-ADMM: Tj = IIA|)Aa II 2 + p||Aj||| for all 1 < j < n (see Assumption H]) 

• H-ADMM: Tj = 2 ; 7 ||ADAa ||2 + Pll-^illi all j G Si and 1 < i < £ (see Assumption fTOl and (l58l) l 

• J-ADMM: Tj = p{n — l)||Aj||| for all 1 < j < n (see [7]) 

for the three methods. To get a sense of the size of these choices for Tj, we plot their magnitudes in Figure)!] 
The x-axis represents the block number and the y-axis the value of Tj. For example, a blue point at the 
value (20,180) means that r 2 o = 180. We can clearly see that the Tj values are much smaller for H-ADMM 
and F-ADMM, than for J-ADMM. Moreover, the Tj values used for F-ADMM and H-ADMM are similar in 
magnitude. This is, perhaps, a disadvantage since a large value for Tj translates into stronger regularization 
in each subproblem (1661) . which in turn translates into smaller steps and potentially slower convergence. For 
our test problem (|551) . this turns out to be the case, as we now discuss. 
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Figure 1: A plot of the magnitude of Ti for each block 1 < i < n for problem (1551) . 

In Table[I] we present the number of epochs needed by each method (averaged over 100 runs). They show 
that H-ADMM and F-ADMM require significantly fewer epochs than J-ADMM to determine the solution 
of problem (ESI) when the theoretical values of Tj are chosen. As discussed in the previous paragraph, we 
can see that the larger values for tj needed by J-ADMM lead to poor numerical performance compared with 
F-ADMM and H-ADMM. However, we remind the reader that H-ADMM and J-ADMM are essentially the 
same cost per epoch (see Remark[T2]), while F-ADMM is generally more costly due to its sequential nature. 
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4358.2 

214.1 

211.3 


Table 1: We present the number of epochs required by J-ADMM, F-ADMM, and H-ADMM for the h- 
minimization problem (|65p using theoretical values of tj for j = 1,..., n. 


4.1.2 Results using values for r obtained by parameter tuning 

In [7], it was mentioned that J-ADMM displays better practical performance for smaller values of tj than 

those required by the convergence theory. In this section, we compare the number of epochs required by 

H-ADMM, F-ADMM, and J-ADMM when tj is allowed to be obtained through parameter tuning. In this 

experiment, for simplicity, we assign the same value Tj for all blocks j = 1,..., n. (i.e., ti = T 2 = ■ ■ ■ = Tn-) 

2 

Moreover, we picked the starting value to be Tj = ^||A|p because it approximates the values of Tj given by 

theory, in the sense that: ||ADAa|P < ||A_d|P||Aa|P sa ||A|p||A|p. H 

Table [2] presents the number of epochs required by J-ADMM, F-ADMM, and H-ADMM on problem (l65l) 

as Tj varies. For each Tj we run each algorithm (J-ADMM, F-ADMM and H-ADMM) on 100 random instances 

of the problem formulation described in Section 14.11 It is clear that all algorithms require fewer epochs to 

satisfy the stopping tolerance as Tj decreases. Moreover, for fixed Tj, F-ADMM and H-ADMM require slightly 

fewer epochs than J-ADMM. Table[5]also shows that F-ADMM and H-ADMM will converge, in practice, for 

2 . 

smaller values of Tj than J-ADMM. In particular, J-ADMM diverged when we set Tj — 0.2 • ^ || A|p, whereas 

F-ADMM and H-ADMM converged for Tj as small as 0.1 • ^||A||'^; both diverged for Tj = 0.09 • ^IIAH"*. 
It is clear that, when the parameter Tj is tuned, F-ADMM and H-ADMM outperform J-ADMM, when 
performance is measured in terms of the number of epochs. 

^There are many other ways that parameter tuning could be implemented, and we have simply implemented one possibility. 
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J-ADMM 

H-ADMM 

F-ADMM 

4iiAr 

530.0 

526.3 

526.2 


324.0 

320.1 

319.9 

0 

217.7 

214.5 

214.1 

0 .22- 

123.1 

119.3 

119.0 

0.2 

— 

95.8 

95.5 

0.1 .^pr 

— 

75.3 

73.0 


Table 2: We present the number of epochs required by J-ADMM, F-ADMM, and H-ADMM for the l^- 
minimization problem (j65p for varying values of Tj. Here, Tj takes the same value for all blocks j = 1,..., n. 


4.2 /i-Minimization with Linear Constraints 

We now consider the problem of Zi-minimization subject to equality constraints as given by 


minimize ||a;||i subject to Ax = 6, 

X 


(70) 


which arises frequently in the compressed sensing and machine learning literature. The one norm promotes 
sparse solutions, while the linear constraints ensure data fidelity. Note that the one norm is separable. 

Problem (170)1 is convex and not strongly convex, which means that H-ADMM(£ = 2) (Algorithm |S]) 
is guaranteed to converge, while convergence for F-ADMM and H-ADMM has not yet been established. 
Nonetheless, we include them in the numerical experiments to study their practical performance. 

For ease of comparison, we follow the experiment setup given in [7]. In particular, suppose that the data 
is partitioned into n = 100 blocks of size Ni = 10 for all 1 < i < n, so that N = ~ 1000- We 

suppose that A = [Ai,..., A„] is randomly generated with Gaussian entries, and that Ai S R™ xNi each 
I < i < n and m = 300, which means that A € The sparse signal x* has A: = 60 randomly located 

nonzero entries, the nonzero entries are Gaussian, and the vector b is defined by b := Ax*. 

For every algorithm we set 7=1 and p = 10/||&||i. For H-ADMM we let n = p£ with p = 4 and 
£ = 25, and for H-ADMM(.^ = 2) we set £ = 2 with both groups containing p = 50 blocks. All algorithms 
were terminated when ||x — a:* || 2 /||a:* II 2 < 10“^°. We report on the number of epochs (as in the previous 
section) and the final constraint residual where r = Ax — b for J-ADMM, F-ADMM, H-ADMM, and 

H-ADMM(£ = 2). All reported results are averages over 100 runs. 

For problem (17(111 . the subproblem for the jth block of x in H-ADMM can be written as 


Xk+i) 

'j 


argmin |\\xj || 1 -b | \\Aj{xj - xf ^) + Ui||^ H- 


IXo- — X 


(fc) I 


(71) 


for j € Si, where Vi is defined in Step [T] of Algorithm [T] (For F-ADMM the subproblems are solved for all 
^ ^ j S: n.) Next, using a similar choice for Pj as given by (1681) . the latter two terms in dm) become 


^\\Aj{xj -xf^)+v^\\l + 

„(D^T aT 


-\\Xj -xf^Wl. 


= - x'f>y'AfAj{xj - x''^’) + p{xj - xf^fAjvi + ^{Xj - x''^’YPj{xj 

= \ixj - xf^YiPj + pAjAj){x3 - xf^) + p{xj - xf'^fAjvi -b ^\Y\\l 
= - xf'^) + p{xj - xf^fAjvi + ^\\vi\\l 


1 


= D - djWl - 


2 
J 112 



||2 

II 2 
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where we have defined dj := — -^A^Vi to derive the last equality. Using the previous equality, the 

solution to subproblem (ED is the same as that given by 

^ argminj —llxjlli + i|la;j(72) 

Xj KTj L J 

which is separable, so that soft thresholding can be used to solve for 

4.2.1 Results using the values of t that satisfy the theory 

Here we present the results of the above stated experiment setup when the values of tj required by the theory 
are used. We recall that the convergence theory for F-ADMM and H-ADMM has not been established in 
the convex case, so here we simply use the values of tj that are needed in the strongly convex case. We 
also recall that convergence of H-ADMM(.^ = 2) is guaranteed in the convex case (see Section [3^ . Thus, 
in addition to the tj values for F-ADMM, H-ADMM, and J-ADMM given in Section 14.1.11 we also use 

• H-ADMM(f = 2): Tj = p||Ai||| for all j G Si and 1 < i < 2 (see (|M|)). 

Figure [5] shows typical Tj values for each algorithm for the ^i-minimization experiment. (For the meaning 
of each plotted point, see Section [4.1.11 1 As it was for the .^ 2 -miiiimization problem (1551) . we again see that 
the Tj values are much smaller for F-ADMM and H-ADMM, when compared to J-ADMM. In addition, the 
Tj values for H-ADMM(£ = 2) are the smallest overall. Consequently, the regularization matrices used in 
H-ADMM(£ = 2) are significantly less positive definite than all other algorithms, and the regularization 
matrices for F-ADMM and H-ADMM are less positive definite than for J-ADMM. 
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Figure 2: A plot of the magnitude of Ti for each block 1 < i < n for problem ED- 

In Table |3] we give the number of epochs required by each algorithm, as well as the final constraint 
residual ||lr|||, where r = Ax — b. Table |3] shows that H-ADMM(£ = 2) requires far fewer epochs than the 
other algorithm, with F-ADMM and H-ADMM requiring about one-sixth the number of epochs compared 
with J-ADMM, for the Tj values stated above. Although there is no convergence theory for F-ADMM and 
H-ADMM, they both converge in practice for this setup, and are very competitive with J-ADMM. 

4.2.2 Results using values for r obtained by parameter tuning 

While theory dictates the values of Tj needed to guarantee convergence, experimental performance can often 
be improved by selecting better parameter values. In this section, we compare the performance of the 
algorithms from the previous section using smaller values of Tj than those used in Section 14.2.11 We repeat 
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J-ADMM 

H-ADMM(£ = 2) 

H-ADMM 

F-ADMM 

Epochs 


Epochs 


Epochs 


Epochs 


12610.1 

0.27e-16 

605.8 

0.19e-16 

1882.1 

0.24e-16 

1879.4 

0.24e-16 


Table 3: We present the number of epochs required and final constraint violation by J-ADMM, F-ADMM, 
H-ADMM, and H-ADMM(^ = 2) for the Zi-minimization problem (1701) for varying values of Tj. 


the experiments described in Section 14.21 but now use the same value Tj for all blocks j = 1,... ,n and for 
all algorithms. The results are presented in Table H) 

Table U shows that for the li-minimization problem, the number of epochs needed by each of the algo¬ 
rithms to reach the stopping tolerance decreases as Tj decreases. Also, for each fixed Tj, J-ADMM requires 
the most epochs followed by H-ADMM(£ = 2) and H-ADMM, while F-ADMM requires the smallest number 
of epochs. This makes intuitive sense because, during every epoch, F-ADMM incorporates new information 
after every block has been updated, H-ADMM incorporates new information after every group of p = 4 
blocks have been updated, H-ADMM(.^ = 2) only incorporates new information after half of the blocks have 
been updated {p = nf 2 = 50), while J-ADMM does not use any updated information within each epoch. 



J-ADMM 

H-ADMM(£ = 2) 

H-ADMM 

E-ADMM 

D 

Epochs 


Epochs 


Epochs 


Epochs 


0.2 

1078.3 

0.23e-16 

1066.2 

0.23e-16 

1051.3 

0.21e-16 

1053.3 

0.21e-16 

0.1 

600.3 

0.19e-16 

581.5 

0.19e-16 

570.0 

0.20e-16 

567.8 

0.22e-16 

0.05- 4Pf 

344.4 

0.19e-16 

328.8 

0.20e-16 

325.9 

0.21e-16 

326.0 

0.21e-16 

0.03- 4pii'‘ 

— 

inf 

— 

inf 

246.3 

0.16e-16 

244.4 

0.15e-16 

0.02- 4Pf 

— 

inf 

— 

inf 

162.2 

0.34e-16 

150.3 

0.29e-16 


Table 4: We present the number of epochs required and final constraint violation by J-ADMM, F-ADMM, 
H-ADMM, and H-ADMM(^ = 2) for the Zi-minimization problem (l70l) for varying values of Tj. 

Next, we can also see that J-ADMM and H-ADMM(£ = 2) perform well until Tj = 0.05-^ || A|||. However, 
for smaller values of Tj, J-ADMM and H-ADMM(t' = 2) diverged. On the other hand, F-ADMM and H- 

ADMM still converge (in practice) for Tj = 0.02'^ || Ajj^, but diverged when r = OO.IO-y ||A|| 2 . Thus, we can 
conclude that if the parameter Tj is hand-tuned for each algorithm, then practical performance is greatly 
improved for all methods, and that both F-ADMM and H-ADMM perform the best in practice on this 
convex optimization problem. 

Remark 15. An adaptive parameter tuning scheme is presented in Section 2.3], which ensures that the 
convergence theory developed for J-ADMM still holds, i.e., convergence of J-ADMM is guaranteed if their 
adaptive parameter tuning scheme is followed. Unfortunately, we were unable to replicate the numerical 
results presented in that paper, because there was not enough information regarding the tuning parameters 
that they used. However, we implemented the adaptive parameter tuning scheme for J-ADMM using the 
following parameters: rj = 0.1, ai = 1.1, j3i = 0.1, Qi = I, for all i = 1,... ,n, and on average over 100 runs 
on the li-minimization experiment, J-ADMM reguired 442.6 epochs. This is more than the ^ 220 epochs 
reported in that paper. In either case, by hand tuning Tj, F-ADMM, H-ADMM, and H-ADMM(£ = 2) all 
outperform J-ADMM. 

Remark 16. Following the same ideas as in 0 Section 2.3], it may be possible to develop adaptive parameter 
updating schemes for F-ADMM and H-ADMM that still ensure convergence. In this way, it may be possible 
to achieve additional computational gains for both of them. 
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5 Conclusion 


We presented an algorithm for minimizing block-separable strongly convex objective functions subject to 
linear equality constraints. Our method, called F-ADMM, may be viewed as a flexible version of the popular 
ADMM algorithm. In particular, F-ADMM is provably convergent for any number of blocks, and contains 
popular methods such as ADMM, J-ADMM, and G-ADMM as special cases. 

Our work was motived by big data applications. We showed, via numerical experiments, that F-ADMM 
is especially effective when the number of blocks is larger than the number of available machines. In this 
case, unlike Jacobi methods, our method allows for updated variables to be used when updating the blocks 
within subsequent groups, all while maintaining essentially the same cost of a fully Jacobi method. Our 
numerical experiments indicate that this approach is more efhcient and stable than the fully Jacobi method. 
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